🧩 The Ultimate LLM Benchmark CollectionПодборка живых бенчмарков

Анализ данных (Data analysis)

🧩 The Ultimate LLM Benchmark Collection

Подборка живых бенчмарков, которые стоит открывать при каждом релизе новой модели — и тех, на которые можно больше не тратить время.

🌐 Общие (multi‑skill) лидерборды
SimpleBench — https://simple-bench.com/index.html

SOLO‑Bench — https://github.com/jd-3d/SOLOBench

AidanBench — https://aidanbench.com

SEAL by Scale (MultiChallenge) — https://scale.com/leaderboard

LMArena (Style Control) — https://beta.lmarena.ai/leaderboard

LiveBench — https://livebench.ai

ARC‑AGI — https://arcprize.org/leaderboard

Thematic Generalization (Lech Mazur) — https://github.com/lechmazur/generalization

дополнительные бенчмарки Lech Mazur:

Elimination Game — https://github.com/lechmazur/elimination_game

Confabulations — https://github.com/lechmazur/confabulations

EQBench (Longform Writing) — https://eqbench.com

Fiction‑Live Bench — https://fiction.live/stories/Fiction-liveBench-Mar-25-2025/oQdzQvKHw8JyXbN87

MC‑Bench (сортировать по win‑rate) — https://mcbench.ai/leaderboard

TrackingAI – IQ Bench — https://trackingai.org/home

Dubesor LLM Board — https://dubesor.de/benchtable.html

Balrog‑AI — https://balrogai.com

Misguided Attention — https://github.com/cpldcpu/MisguidedAttention

Snake‑Bench — https://snakebench.com

SmolAgents LLM (из‑за GAIA & SimpleQA) — https://huggingface.co/spaces/smolagents/smolagents-leaderboard

Context‑Arena (MRCR, Graphwalks) — https://contextarena.ai

OpenCompass — https://rank.opencompass.org.cn/home

HHEM (Hallucination) — https://huggingface.co/spaces/vectara/leaderboard

🛠️ Coding / Math / Agentic
Aider‑Polyglot‑Coding — https://aider.chat/docs/leaderboards/

BigCodeBench — https://bigcode-bench.github.io

WebDev‑Arena — https://web.lmarena.ai/leaderboard

WeirdML — https://htihle.github.io/weirdml.html

Symflower Coding Eval v1.0 — https://symflower.com/en/company/blog/2025/dev-quality-eval-v1.0-anthropic-s-claude-3.7-sonnet-is-the-king-with-help-and-deepseek-r1-disappoints/

PHYBench — https://phybench-official.github.io/phybench-demo/

MathArena — https://matharena.ai

Galileo Agent Leaderboard — https://huggingface.co/spaces/galileo-ai/agent-leaderboard

XLANG Agent Arena — https://arena.xlang.ai/leaderboard

🚀 Для отслеживания AI take‑off
METR Long‑Task Benchmarks (вкл. RE Bench) — https://metr.org

PaperBench — https://openai.com/index/paperbench/

SWE‑Lancer — https://openai.com/index/swe-lancer/

MLE‑Bench — https://github.com/openai/mle-bench

SWE‑Bench — https://swebench.com

🏆 Обязательный «классический» набор
GPQA‑Diamond — https://github.com/idavidrein/gpqa

SimpleQA — https://openai.com/index/introducing-simpleqa/

Tau‑Bench — https://github.com/sierra-research/tau-bench

SciCode — https://github.com/scicode-bench/SciCode

MMMU — https://mmmu-benchmark.github.io/#leaderboard

Humanities Last Exam (HLE) — https://github.com/centerforaisafety/hle

🔍 Классические бенчмарков

Simple‑Evals — https://github.com/openai/simple-evals

Vellum AI Leaderboard — https://vellum.ai/llm-leaderboard

Artificial Analysis — https://artificialanalysis.ai

⚠️ «Перегретые» метрики, на которые можно не смотреть
MMLU, HumanEval, BBH, DROP, MGSM

Большинство чисто‑математических датасетов: GSM8K, MATH, AIME, ...

Модели близки к верхним значениям на них и в них нет особого смысла.

www.tg-me.com/us/Анализ данных Data analysis/com.data_analysis_ml/3536

6.0K viewsMay 5 at 07:03

tg-me.com/data_analysis_ml/3536

Create: 2025-05-05
Last Update: 2025-06-01 05:03:20

BY Анализ данных (Data analysis)

Warning: Undefined variable $i in /var/www/tg-me/post.php on line 283

Share with your friend now:
tg-me.com/data_analysis_ml/3536

Анализ данных Data analysis Telegram | DID YOU KNOW?

What is Telegram Possible Future Strategies?

🧩 The Ultimate LLM Benchmark CollectionПодборка живых бенчмарков